Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

GoryMoon · 2023-11-26T23:30:55Z

Description

See the issue #274 for more details about the issue.

I identified that when the incorrectly encoded characters showed up the to_html function didn't have the $toHTML global array so instead used the htmlentities function causing the html encodings.

SuiteCRM-Core/public/legacy/include/utils/db_utils.php

Lines 99 to 102 in 117dd81

 if (is_array($toHTML)) { 

 $string = str_ireplace($GLOBALS['toHTML_keys'], $GLOBALS['toHTML_values'] ?? [], $string); 

 } else { 

 $string = htmlentities($string, ENT_HTML401|ENT_QUOTES, 'UTF-8');

~~With the small change of defining the variable in the $GLOBALS array, the function will use that and encode it correctly.~~

~~There might be more related to encoding issues but this fixes my issues and what it looks like the issue is having.~~

See this comment for an update on the changes to this: #403 (comment)

Motivation and Context

How To Test This

Change the name of something that is referenced.
a. For example, change the company name that is referenced in a Call. (Like adding åäö or other letters)
Got the the list of that data and you should see the HTML encoded letters

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Final checklist

My code follows the code style of this project found here.
My change requires a change to the documentation.
I have read the How to Contribute guidelines.

serhiisamko091184 · 2023-11-27T15:12:21Z

Hello @GoryMoon,
thank you very much for your contribution!

Would you be so kind to check with me replication steps:

Initially I've created an account:

After that I created a call to make a relationship with the account:
I've applied the current fix and the letters "åäö" are encoded properly. At the same time single quotes are rendered without appropriate encoding:

Is this something you could have a look at if you have time (characters ', '', <, >)?

Thanks in advance!

Regards,
Serhii

pgorod · 2023-11-27T16:01:54Z

I don't know if anybody read my comment in the Issue...

I'd really like to understand this better. The question "how is this in the Database" is an important one. If there's HTML encoding in the database, it's wrong. If we don't put it there in the first-place, it's better.

I also don't think that we should be re-implementing basic stuff that PHP is supposed to be able to handle correctly, like string substitutions to encode HTML. Otherwise, now you fixed four cases, then @serhiisamko091184 asks you for 4 more, and we can go on indefinitely but we should just use complete functions from the existing libraries.

GoryMoon · 2023-11-27T23:35:06Z

I have done some debugging to get a better understanding of when the encoding issues occur and why it works in some places. This is just a collection of stuff I found related to this issue.
I did this testing while forcing the to_html() to use htmlentities() instead of the string replace one.

I'd really like to understand this better. The question "how is this in the Database" is an important one. If there's HTML encoding in the database, it's wrong. If we don't put it there in the first-place, it's better.

From testing it looks like it saves it like normal and not HTML encoded, when saving in cleans up the input and removes some stuff if it's not allowed. When trying to insert <script>alert(1);</script> in a company name it will remove the whole script.
After manually inserting the above script into the database it will show the text but don't execute it. As Angular just treats it as text when inserting it into the DOM, to make it run it needs to be like a html-field and not like for example like the relate-field currently is.

The encoding error happens on pages where the GraphQL returns the data HTML encoded and angular handles the data. For pages when it works, it's either not encoded when angular receives it or it's a legacy iframe that contains the HTML-encoded values directly.

As an example: on the Account page the records are received from the RecordHandler which tells the Bean not to encode the data:

SuiteCRM-Core/core/backend/Data/LegacyHandler/RecordHandler.php

Line 151 in 117dd81

$bean = BeanFactory::getBean($moduleName, $id, ['encode' => false]);

Most issues where you have a Related To field exist because the underlying list fetching data encodes the parent data, and the normal record data is decoded.

As an example this is from the Calls list the name of the record is correct but the Related To is encoded

The record is correct because it goes through this function and maps the fields but not the parent_name field containing the related name.

SuiteCRM-Core/public/legacy/include/portability/ApiBeanMapper/ApiBeanMapper.php

Line 458 in 117dd81

$arr[$name] = html_entity_decode($bean->$field ?? '', ENT_QUOTES);

Another place would be the timeline where it could be called to not encode the data from the backend code like the Account records.

A encode parameter could be added to reach the fetchAll functions until you get to the SubpanelCustomQueryPort that runs the fetchByAssoc function that takes the encode parameter.

SuiteCRM-Core/core/backend/Data/LegacyHandler/PresetDataHandlers/HistoryTimelineDataHandler.php

Line 283 in 117dd81

$listData = $this->fetchAll($combinedQuery);

I also don't think that we should be re-implementing basic stuff that PHP is supposed to be able to handle correctly, like string substitutions to encode HTML. Otherwise, now you have fixed four cases, then serhiisamko091184 asks you for 4 more, and we can go on indefinitely but we should just use complete functions from the existing libraries.

I agree with using existing functions rather than reimplementing them.
I didn't know how much would be needed to fix this but when I found that the function didn't work as intended this could ba e small fix the solve part of the annoyance of encoding utf8 characters.

SuiteBot · 2024-08-27T14:25:13Z

All committers have signed the CLA.

GoryMoon · 2024-08-27T14:46:47Z

I've looked over this once again after discovering that a change in version 8.6.1 caused descriptions and other fields to have the wrong encoding.
This was the change that caused it:
f483bec#diff-033d0d7079b9ca4c1ff8658dc33517d717937747e7ee218622dae6abb0173273R60-R65

I've tested adding the string that @yunusyerli1 was using in #501 (Şebnem Çakğır šččřž 안녕하세요감사합니다ΚαλημέραΕυχαριστώสวัสดีनमस्ते) to a bunch of places. Name, description, address and more.

Compared to the previous change this is a bit bigger.
The main cause for the encoding issue was the use of htmlentities, by replacing it with htmlspecialchars only characters that have a special significance in HTML is encoded leaving UTF-8 characters.
I also removed the "re-implemented" encoding/decoding that @pgorod pointed out and I agree with.

There a more htmlentities in the code but I haven't looked up how they get called as this change is solves the locations I've noticed.

…L Entities

GoryMoon · 2024-08-30T12:41:54Z

Even after the previous changes quotes " ' and < > was encoded, by adding the html_entity_decode to the TextMapper this is fixed.

In some specific cases it double encoded some characters so that's why there are two html_entity_decode.

This mapper is used for the api and text fields so html code shouldn't be a problem. I've tested with injecting this <script>alert(1);</script> in the database (it's cleaned by the purify) and directly in the code and nothing is executed.

This change resolves #510

pgorod · 2024-08-30T14:28:48Z

If there's double-encoding, shouldn't it be fixed by preventing the second encoding, instead of adding a second decode?

I'm very concerned with the way that the craziness of v7 misguided/excessive purification is creeping into v8 too... we really really really must not just patch "undo" code on top of "let's mess things up" code. Twice.

This is not a criticism of the current PR, it is a wider issue... still, we shouldn't merge just because it solves the current problem, we should really check that we're doing things the right way.

chris001 · 2024-09-01T05:25:49Z

The code paths could be traced, and places where double encoding happens should be eliminated, which could be difficult.

Or, the PHP string could be encapsulated inside a SuiteString class that adds methods to get the various encoded versions of the original UTF-8 string - for the web interface, etc. Problem solved.

GoryMoon · 2024-09-02T08:15:30Z

Just documenting the double encoding reason, it's needed for the specific cases below.
I agree that it's not a good solution and should be prevented instead of going back and forth.

Without the double decoding the description looks like this:

Adding one decode it will look like this, the part that is double encoded looks to be non html/xml block.

And with both decodes it looks like this:

This issue was introduced in f483bec for the v8.6.1 release to fix what looks like a bunch of CVEs.
Not sure it there was a specific case for the TextMapper or it if was applied to anything that might cause it. That makes me a bit hesitant to change too much without knowing the reason.

serhiisamko091184 added the Status:Requires Updates Issues & PRs which requires input or update from the author label Nov 27, 2023

yunusyerli1 mentioned this pull request Jun 13, 2024

Fix #274 - Fix Views HTML Entities Which Encode UTF Characters Incorrectly #403 #501

Open

6 tasks

yunusyerli1 self-assigned this Jun 20, 2024

yunusyerli1 added Status: Passed Code Review and removed Status:Requires Updates Issues & PRs which requires input or update from the author labels Jun 20, 2024

GoryMoon force-pushed the bugfix_274 branch from 46c8af4 to cc74da5 Compare August 27, 2024 14:25

GoryMoon force-pushed the bugfix_274 branch from cc74da5 to cbf664c Compare August 27, 2024 14:29

Fix salesagility#274 - Views Incorrectly Encode UTF Characters as HTM…

c1edbab

…L Entities

GoryMoon force-pushed the bugfix_274 branch from cbf664c to c1edbab Compare August 30, 2024 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

GoryMoon commented Nov 26, 2023 •

edited

Loading

serhiisamko091184 commented Nov 27, 2023

pgorod commented Nov 27, 2023

GoryMoon commented Nov 27, 2023

SuiteBot commented Aug 27, 2024 •

edited

Loading

GoryMoon commented Aug 27, 2024

GoryMoon commented Aug 30, 2024

pgorod commented Aug 30, 2024

chris001 commented Sep 1, 2024

GoryMoon commented Sep 2, 2024

	if (is_array($toHTML)) {
	$string = str_ireplace($GLOBALS['toHTML_keys'], $GLOBALS['toHTML_values'] ?? [], $string);
	} else {
	$string = htmlentities($string, ENT_HTML401\|ENT_QUOTES, 'UTF-8');

Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

Are you sure you want to change the base?

Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

Conversation

GoryMoon commented Nov 26, 2023 • edited Loading

Description

Motivation and Context

How To Test This

Types of changes

Final checklist

serhiisamko091184 commented Nov 27, 2023

pgorod commented Nov 27, 2023

GoryMoon commented Nov 27, 2023

SuiteBot commented Aug 27, 2024 • edited Loading

GoryMoon commented Aug 27, 2024

GoryMoon commented Aug 30, 2024

pgorod commented Aug 30, 2024

chris001 commented Sep 1, 2024

GoryMoon commented Sep 2, 2024

GoryMoon commented Nov 26, 2023 •

edited

Loading

SuiteBot commented Aug 27, 2024 •

edited

Loading